Translingual Document Representations from Discriminative Projections
نویسندگان
چکیده
Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corresponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.
منابع مشابه
Discriminative models for robust image classification
A variety of real-world tasks involve the classification of images into pre-determined categories. Designing image classification algorithms that exhibit robustness to acquisition noise and image distortions, particularly when the available training data are insufficient to learn accurate models, is a significant challenge. This dissertation explores the development of discriminative models for...
متن کاملTranslingual Information Retrieval: Learning from Bilingual Corpora
Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more diierent languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation and statistical-IR appr...
متن کاملRapidly Retargetable Interactive Translingual Retrieval
This paper describes a system for rapidly retargetable interactive translingual retrieval. Basic functionality can be achieved for a new document language in a single day, and further improvements require only a relatively modest additional investment. We applied the techniquesfirst to searchChinese collections using English queries, and have successfully added French, German, and Italian docum...
متن کاملTranslingual Information Retrieval: Learning from Bilingual Corpora (ai Journal Special Issue: Best of Ijcai-97)
Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more diierent languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation and statistical-IR appr...
متن کاملTranslingual Information Access
We present an attempt at a coherent vision of an end-to-end translingual information retrieval system. We begin by presenting a sample of the broad range of possibilities, and the results of some initial work comparing the different approaches. We then present an overall workstation architecture, followed by two possible approaches to the actual translingual IR stage presented in detail. Rankin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010